Skip to content

Latest commit

 

History

History
121 lines (70 loc) · 5 KB

Building a Production Model for Retrieval Based Ch.rst

File metadata and controls

121 lines (70 loc) · 5 KB

Building a Production Model for Retrieval-Based Chatbots

Original Paper

Introduction

Chatbot dialogue system:

  • Generative approach (not focused in this paper)
  • Retrival approach: better control over response quality than generative approaches. Selct outputs from a whitelist of candidate responses.

Challenge:

  • Inference speed
  • Whitelist selction process and the associated retrieval evaluation. Recall: over-simplified.

This paper use SRU instead of LSTM. It is faster in training and inference

3. Model Architecture

2 inputs:

  • context c: concatenate of all utterances in the conversation. We use special tokens to indicate whether each utterance comes from the customer or the agent.
  • candidate response r.

1 output

A score s(c, r) indicating the relenvance of the response to the context

image

3.1 Dual encoder

Core of the model, 2 neural encoders fc and fr to encode the context and the response, respectively. They have identical architecture but sperate weights.

Input of encoders: w = w1, w2, ...wn, either text or response.

Use fastText as word embedding method due to the prevalence of typos in both user and agent utterance in real chats.

Each encoder consists of 1 recurrent neural network and 1 multi-headed attention layer.

Recurrent neural network: multi-layer, bidirectional SRUs, each layer of SRU involves the following computation

$$\begin{aligned} f_t = \sigma (W_fx_t + v_f \odot c_{t-1} + b_f) \\ \\\ c_t = f_t \odot c_{t-1} + (1 - f_t) \odot (W_{x_t}) \\ \\\ r_t = \sigma (W_rx_t + v_r \odot c_{t-1} + b_r) \\ \\\ h_t = r_t \odot c_t + (1 - r_t) \odot x_t \end{aligned}$$

where σ is the sigmoid activation function. W, Wf, Wr ∈ Rdh * de are learned matrices and vf, vr, bf, bv ∈ Rdh are learned parameters

The multi-headed attention layer compresses the encoded sequence h = h1, h2, ...hn into a single vector. For each attention head i, attention weights are generating with the following computation


α(i) = softmax(σ(hTWa(i))va(i))

where σ is a non-linear activation function, Wa(i) ∈ Rdh * da is a learned parameter matrix and va(i) ∈ Rda is a learned parameter vector.

The encoded sequence representation is the pooled to a single vector for each attention head i by summing the attended representation

$$\hat{h}^{(i)} = \sum_{j=1}^{n} \alpha^{(i)}_jh_j$$

Finally, the pooled encodings are averaged accross the nh attention head

$$\hat{h} = \frac{1}{n_h} \sum_{i=1}^{n_h} \hat{h}^{(i)}$$

The output of the encoder function is the vector f(w) = 

3.2 Scoring

To determine the relevance of a response r to a context c, the model computes a matching scoring between the context encoding fc(c) and the response encoding fr(r). This score is simply the dot product of the encodings:

..math:: s(c, r) = f_c(c) cdot f_r(r)

3.3 Training

We optimize the model the maximize the score between the context c and responce r+ actually sent by the agent while minimize the score between the context and each of k random "negative" response r1, ..., rk. Although negative response could be sampled seperately for each context-response pair, we instead share a set of negative response accross all examples in a batch. This has the benefit of reducing the number of responsees that need to be encoded in each batch of size of b from O(kb) to O(k + b)

3.5 Whitelist Generation

2 method of create a whitelist from which our model selects responses at inference time:

  • Frequency-based method: select 1,000 or 10,000 most common agent responses.
  • Clustering-base method: encode all agent responses using response encoder fr and use k-means clustering with k = 1000 or k = 10,000 to cluster the reponse. Then selected the most common response from each cluster to create whitelist

Resource